Featurization, Model Selection, and Tuning Project¶
Problem Statement¶
Context
- The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Objective
- Customers’ leaving credit card services would lead the bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and the reason for same – so that the bank could improve upon those areas.
You as a Data Scientist at Thera Bank need to explore the data provided, identify patterns, and come up with a classification model to identify customers likely to churn, and provide actionable insights and recommendations that will help the bank improve its services so that customers do not renounce their credit cards.
Let us start by importing the required libraries.¶
In [5]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import classification_report, accuracy_score, roc_auc_score, confusion_matrix
import warnings
warnings.filterwarnings('ignore')
In [6]:
#Load the dataset
df = pd.read_csv('BankChurners.csv')
In [7]:
df.sample(5)
Out[7]:
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | ... | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9272 | 712256358 | Existing Customer | 40 | M | 3 | Graduate | Married | $60K - $80K | Blue | 23 | ... | 3 | 1 | 22729.0 | 1581 | 21148.0 | 1.025 | 13672 | 110 | 0.746 | 0.070 |
| 4216 | 715093383 | Existing Customer | 49 | F | 2 | College | Married | Less than $40K | Blue | 41 | ... | 1 | 3 | 2115.0 | 1598 | 517.0 | 0.934 | 4450 | 68 | 0.545 | 0.756 |
| 5269 | 789097833 | Existing Customer | 39 | F | 0 | High School | Single | Less than $40K | Blue | 27 | ... | 3 | 3 | 3927.0 | 720 | 3207.0 | 0.630 | 4230 | 93 | 0.898 | 0.183 |
| 3957 | 717483558 | Existing Customer | 42 | M | 3 | High School | Married | $120K + | Blue | 36 | ... | 1 | 2 | 19156.0 | 1353 | 17803.0 | 1.410 | 3547 | 61 | 0.649 | 0.071 |
| 9973 | 721005108 | Existing Customer | 41 | M | 0 | College | Married | $40K - $60K | Blue | 29 | ... | 2 | 2 | 6991.0 | 1384 | 5607.0 | 0.660 | 14994 | 104 | 0.733 | 0.198 |
5 rows × 21 columns
Exploratory Data Analysis¶
In [9]:
#get the information about the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
Observations
- No.of columns = 21
- No.of data points = 10,127
- Different data types in the dataset: float64(5), int64(10), object(6)
In [11]:
# Check for missing values
df.isnull().sum()
Out[11]:
CLIENTNUM 0 Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
Observations:
- Education_Level: 1519 missing values(approximately 15% of the dataset)
- Martial_status: 749 missing values(approximately 7.4% of the dataset)
In [13]:
df.describe().T
Out[13]:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CLIENTNUM | 10127.0 | 7.391776e+08 | 3.690378e+07 | 708082083.0 | 7.130368e+08 | 7.179264e+08 | 7.731435e+08 | 8.283431e+08 |
| Customer_Age | 10127.0 | 4.632596e+01 | 8.016814e+00 | 26.0 | 4.100000e+01 | 4.600000e+01 | 5.200000e+01 | 7.300000e+01 |
| Dependent_count | 10127.0 | 2.346203e+00 | 1.298908e+00 | 0.0 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 |
| Months_on_book | 10127.0 | 3.592841e+01 | 7.986416e+00 | 13.0 | 3.100000e+01 | 3.600000e+01 | 4.000000e+01 | 5.600000e+01 |
| Total_Relationship_Count | 10127.0 | 3.812580e+00 | 1.554408e+00 | 1.0 | 3.000000e+00 | 4.000000e+00 | 5.000000e+00 | 6.000000e+00 |
| Months_Inactive_12_mon | 10127.0 | 2.341167e+00 | 1.010622e+00 | 0.0 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 6.000000e+00 |
| Contacts_Count_12_mon | 10127.0 | 2.455317e+00 | 1.106225e+00 | 0.0 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 6.000000e+00 |
| Credit_Limit | 10127.0 | 8.631954e+03 | 9.088777e+03 | 1438.3 | 2.555000e+03 | 4.549000e+03 | 1.106750e+04 | 3.451600e+04 |
| Total_Revolving_Bal | 10127.0 | 1.162814e+03 | 8.149873e+02 | 0.0 | 3.590000e+02 | 1.276000e+03 | 1.784000e+03 | 2.517000e+03 |
| Avg_Open_To_Buy | 10127.0 | 7.469140e+03 | 9.090685e+03 | 3.0 | 1.324500e+03 | 3.474000e+03 | 9.859000e+03 | 3.451600e+04 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | 7.599407e-01 | 2.192068e-01 | 0.0 | 6.310000e-01 | 7.360000e-01 | 8.590000e-01 | 3.397000e+00 |
| Total_Trans_Amt | 10127.0 | 4.404086e+03 | 3.397129e+03 | 510.0 | 2.155500e+03 | 3.899000e+03 | 4.741000e+03 | 1.848400e+04 |
| Total_Trans_Ct | 10127.0 | 6.485869e+01 | 2.347257e+01 | 10.0 | 4.500000e+01 | 6.700000e+01 | 8.100000e+01 | 1.390000e+02 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | 7.122224e-01 | 2.380861e-01 | 0.0 | 5.820000e-01 | 7.020000e-01 | 8.180000e-01 | 3.714000e+00 |
| Avg_Utilization_Ratio | 10127.0 | 2.748936e-01 | 2.756915e-01 | 0.0 | 2.300000e-02 | 1.760000e-01 | 5.030000e-01 | 9.990000e-01 |
In [14]:
# Check for duplicate rows
df.duplicated().sum()
Out[14]:
0
Observations:
- There are no duplicate rows.
Handling Missing Values¶
- It is better to handle the missing values before we proceed with the univariate and bivariate analysis to avoid biased analysis or visualization issues.
In [18]:
#let us see the different values in education level and marital status columns
df['Education_Level'].value_counts()
Out[18]:
Education_Level Graduate 3128 High School 2013 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: count, dtype: int64
In [19]:
df['Marital_Status'].value_counts()
Out[19]:
Marital_Status Married 4687 Single 3943 Divorced 748 Name: count, dtype: int64
As the number of missing values is more in both columns, it is not an appropriate choice to drop these rows or data points since we might lose out on some important information that might help the models identify the underlying patterns.¶
So, it is better to create a separate category as 'Unknown' rather than imputing it with the mode.¶
In [22]:
df['Education_Level'].fillna(value = 'Unknown', inplace = True)
df['Marital_Status'].fillna(value = 'Unknown', inplace = True)
In [23]:
df.sample(10)
Out[23]:
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | ... | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9839 | 721145433 | Existing Customer | 52 | F | 5 | High School | Single | Less than $40K | Silver | 36 | ... | 2 | 2 | 11280.0 | 1523 | 9757.0 | 0.765 | 16237 | 128 | 0.753 | 0.135 |
| 2312 | 825756333 | Existing Customer | 55 | F | 0 | Uneducated | Married | Less than $40K | Blue | 51 | ... | 3 | 1 | 2041.0 | 1885 | 156.0 | 0.587 | 2118 | 48 | 0.778 | 0.924 |
| 7010 | 790108908 | Existing Customer | 44 | M | 2 | Post-Graduate | Married | $80K - $120K | Blue | 37 | ... | 1 | 1 | 34058.0 | 0 | 34058.0 | 0.552 | 4043 | 57 | 0.781 | 0.000 |
| 2448 | 709616958 | Existing Customer | 38 | M | 2 | High School | Single | $60K - $80K | Silver | 30 | ... | 3 | 4 | 26556.0 | 1583 | 24973.0 | 0.694 | 3168 | 73 | 0.738 | 0.060 |
| 2535 | 716378658 | Existing Customer | 55 | F | 2 | Unknown | Married | Less than $40K | Blue | 36 | ... | 2 | 3 | 2629.0 | 1476 | 1153.0 | 0.537 | 1725 | 41 | 0.414 | 0.561 |
| 9047 | 718353633 | Attrited Customer | 50 | M | 1 | Post-Graduate | Unknown | $80K - $120K | Gold | 36 | ... | 3 | 2 | 34516.0 | 0 | 34516.0 | 1.032 | 5547 | 75 | 0.744 | 0.000 |
| 109 | 708476808 | Existing Customer | 54 | M | 4 | Unknown | Divorced | $120K + | Blue | 36 | ... | 3 | 2 | 33791.0 | 1960 | 31831.0 | 0.618 | 1047 | 31 | 0.824 | 0.058 |
| 9780 | 708694983 | Existing Customer | 55 | F | 1 | Uneducated | Unknown | Less than $40K | Blue | 36 | ... | 2 | 1 | 3150.0 | 2222 | 928.0 | 0.794 | 14750 | 96 | 0.655 | 0.705 |
| 1331 | 717268458 | Existing Customer | 61 | M | 0 | Graduate | Married | $60K - $80K | Blue | 49 | ... | 3 | 4 | 3352.0 | 1713 | 1639.0 | 0.657 | 1952 | 50 | 0.667 | 0.511 |
| 1931 | 780998133 | Existing Customer | 43 | M | 3 | Graduate | Married | $80K - $120K | Blue | 34 | ... | 4 | 4 | 10540.0 | 1489 | 9051.0 | 1.029 | 1980 | 45 | 0.607 | 0.141 |
10 rows × 21 columns
In [24]:
#Extracting just categorical columns
Categorical_cols = df.select_dtypes(include = 'object')
In [25]:
categorical_columns = list(Categorical_cols.columns)
In [26]:
categorical_columns
Out[26]:
['Attrition_Flag', 'Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category']
In [27]:
for i in categorical_columns:
print(df[i].value_counts())
print('-'*50)
Attrition_Flag Existing Customer 8500 Attrited Customer 1627 Name: count, dtype: int64 -------------------------------------------------- Gender F 5358 M 4769 Name: count, dtype: int64 -------------------------------------------------- Education_Level Graduate 3128 High School 2013 Unknown 1519 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: count, dtype: int64 -------------------------------------------------- Marital_Status Married 4687 Single 3943 Unknown 749 Divorced 748 Name: count, dtype: int64 -------------------------------------------------- Income_Category Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: count, dtype: int64 -------------------------------------------------- Card_Category Blue 9436 Silver 555 Gold 116 Platinum 20 Name: count, dtype: int64 --------------------------------------------------
Observations:
- The category 'abc' in the Income_Category column seems suspicious and could be an invalid or placeholder value. It’s crucial to address it to avoid impacting the analysis or model performance.
- Let us replace 'abc' with the mode (Less than $40K), as it is the most frequent and likely represents the majority income level.
In [29]:
df['Income_Category'] = df['Income_Category'].replace('abc','Less than $40K')
In [30]:
#Let us verify if the replacement is done correctly.
df['Income_Category'].value_counts()
Out[30]:
Income_Category Less than $40K 4673 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 $120K + 727 Name: count, dtype: int64
In [31]:
df['CLIENTNUM'].value_counts()
Out[31]:
CLIENTNUM
768805383 1
711784908 1
720133908 1
803197833 1
812222208 1
..
796903083 1
716014983 1
712521783 1
780045708 1
714337233 1
Name: count, Length: 10127, dtype: int64
In [32]:
# CLIETNUM does not add any information to our analysis so it is better to dropt his column.
df.drop(columns = ['CLIENTNUM'],axis =1, inplace = True)
In [33]:
df.sample(5)
Out[33]:
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5593 | Existing Customer | 53 | F | 3 | Graduate | Married | Less than $40K | Blue | 35 | 4 | 3 | 1 | 3390.0 | 2180 | 1210.0 | 0.877 | 4585 | 78 | 0.950 | 0.643 |
| 1010 | Existing Customer | 43 | F | 2 | Uneducated | Single | $40K - $60K | Blue | 29 | 6 | 3 | 2 | 2162.0 | 1377 | 785.0 | 0.600 | 1555 | 29 | 0.706 | 0.637 |
| 3451 | Existing Customer | 48 | F | 2 | High School | Single | Less than $40K | Blue | 36 | 3 | 3 | 3 | 5483.0 | 535 | 4948.0 | 0.681 | 2051 | 49 | 0.633 | 0.098 |
| 1978 | Existing Customer | 38 | F | 1 | Graduate | Married | $40K - $60K | Blue | 20 | 5 | 1 | 3 | 5153.0 | 2517 | 2636.0 | 1.488 | 3095 | 63 | 0.969 | 0.488 |
| 871 | Existing Customer | 44 | M | 3 | Unknown | Married | $60K - $80K | Blue | 37 | 3 | 2 | 3 | 6224.0 | 0 | 6224.0 | 0.738 | 1463 | 34 | 0.889 | 0.000 |
The data is now clean and ready for analysis.¶
Univariate Analysis¶
In [36]:
# Let us first analyze numerical columns
# Select numerical columns for univariate analysis
numerical_columns = df.select_dtypes(include=['int64', 'float64']).columns
# Plot histograms for numerical columns
for column in numerical_columns:
plt.figure(figsize=(8, 4))
plt.hist(df[column], bins=30, color='orange', alpha=0.7, edgecolor='black')
plt.title(f"Distribution of {column}")
plt.xlabel(column)
plt.ylabel("Frequency")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
Observation:
- Customer_Age: The distribution is approximately normal, with most customers aged between 40 and 50.
- Credit_Limit: Highly skewed, with the majority of customers having a credit limit below 10,000.
- Avg_Open_To_Buy: Similar to Credit_Limit, most values are concentrated at lower levels.
- Total_Trans_Ct: The data appears bimodal, with two peaks around 60–70 and 70–90 transactions.
- Total_Trans_Amt: Skewed, with a significant concentration of customers at lower transaction amounts.
- Avg_Utilization_Ratio: Skewed, with most values concentrated near zero, indicating low credit usage.
- Months_on_book: Normal distribution centered around 35–40 months.
- Months_Inactive_12_mon: Peaks at 2–3 months, suggesting many customers were inactive for only a short period.
- Total_Amt_Chng_Q4_Q1: Mostly centered around 0.5–1.0, showing little variation in transaction amounts between the two quarters.
In [38]:
fig, axes = plt.subplots(len(categorical_columns), 1, figsize=(10, len(categorical_columns) * 5))
fig.tight_layout(pad=5.0)
for i, col in enumerate(categorical_columns):
sns.countplot(data=df, x=col, order=df[col].value_counts().index, ax=axes[i])
axes[i].set_title(f"Count of {col}")
axes[i].set_xlabel(col)
axes[i].set_ylabel("Frequency")
for p in axes[i].patches: # Adding labels on bars
axes[i].annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
textcoords='offset points')
plt.show()
Observations:
Attrition_Flag:
- Most customers are existing customers (around 85%).
- Only about 15% are attrited customers.
Gender:
- The distribution is slightly skewed, with more female customers (53%) compared to males (47%).
Education_Level:
- Most customers are graduates, followed by high school education.
- A significant number of customers are categorized as "Uneducated" or "Unknown."
Marital_Status:
- A majority of customers are married, followed by single customers.
- Divorced customers represent a smaller proportion.
Income_Category:
- Most customers fall into the "Less than 40K" category, followed by the "40K - 60K" income range.
- Very few customers have incomes above 120K.
Card_Category:
- The majority of customers hold "Blue" cards.
- Other card categories like "Silver," "Gold," and "Platinum" are relatively rare.
In [40]:
df['Attrition_Flag'].value_counts()
Out[40]:
Attrition_Flag Existing Customer 8500 Attrited Customer 1627 Name: count, dtype: int64
Bivariate Analysis¶
In [42]:
# Numerical Variables vs Attrition Flag Column
# Bivariate Analysis: Relationship between numerical variables and 'Attrition_Flag'
# Plotting boxplots to show distributions of numerical features across attrition categories
fig, axes = plt.subplots(len(numerical_columns), 1, figsize=(10, len(numerical_columns) * 6))
fig.tight_layout(pad=5.0)
for i, col in enumerate(numerical_columns):
sns.boxplot(data=df, x='Attrition_Flag', y=col, ax=axes[i])
axes[i].set_title(f"{col} vs Attrition_Flag")
axes[i].set_xlabel("Attrition_Flag")
axes[i].set_ylabel(col)
plt.show()
Observations:
- Customer_Age: Attrited customers tend to be slightly older than existing customers.
- Dependent_count: 75% of the Attrited customers have dependent count above 2. While in case of Existing customers 50% of them have dependent count more than 2.
- Months_on_book: Attrited customers seem to have been with the bank for a shorter period compared to existing customers.
- Total_Relationship_Count: Existing customers have a slightly higher relationship count with the bank compared to attrited customers.
- Months_Inactive_12_mon: Attrited customers show higher inactivity in the last 12 months.
- Contacts_Count_12_mon: Attrited customers have slightly higher contact counts, which might indicate more issues or complaints.
- Credit_Limit: Attrited customers generally have lower credit limits.
- Total_Revolving_Bal: Attrited customers show higher revolving balances, indicating they carry forward more unpaid balances.
- Avg_Open_To_Buy: Existing customers have more available credit compared to attrited customers.
- Total_Amt_Chng_Q4_Q1: No significant difference in spending changes between the two categories.
- Total_Trans_Amt: Existing customers have significantly higher total transaction amounts.
- Total_Trans_Ct: Existing customers also show higher transaction counts.
- Total_Ct_Chng_Q4_Q1: No significant difference in transaction count changes.
- Avg_Utilization_Ratio: Attrited customers have higher utilization ratios, indicating they are using more of their available credit.
In [44]:
#Categorical variables vs Attrition Flag
fig, axes = plt.subplots(len(categorical_columns), 1, figsize=(12, len(categorical_columns) * 4))
fig.tight_layout(pad=5.0)
for i, col in enumerate(categorical_columns):
sns.countplot(data=df, x=col, hue='Attrition_Flag', ax=axes[i], order=df[col].value_counts().index)
axes[i].set_title(f"{col} vs Attrition_Flag")
axes[i].set_xlabel(col)
axes[i].set_ylabel("Count")
for p in axes[i].patches: # Adding labels on bars
axes[i].annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
textcoords='offset points')
plt.show()
Observations:
Gender:
- Both genders show a similar proportion of attrition, but female customers have a slightly higher attrition count.
Education_Level:
- Graduates have the highest attrition count, possibly because they form the majority of the dataset.
- Customers with "Unknown" education levels also show a noticeable attrition rate.
Marital_Status:
- Married customers have slightly higher attrition counts compared to single customers.
- Divorced customers have the least attrition count, but their overall representation is also low.
Income_Category:
- Customers earning less than $40K experience higher attrition, likely because they form the majority of the dataset.
- Attrition is relatively lower for higher income brackets (120K+).
Card_Category:
- The majority of attrition occurs among customers holding "Blue" cards, as this category represents the majority.
- Attrition is significantly lower for "Silver," "Gold," and "Platinum" cardholders.
In [46]:
# Compute correlation matrix
numerical_data = df.select_dtypes(include=['int64', 'float64'])
correlation_matrix = numerical_data.corr()
# Plot heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5)
plt.title("Heatmap of Numerical Relationships")
plt.show()
Observations:
Strong Positive Correlations:
- Credit_Limit and Avg_Open_To_Buy (1.00):This indicates that customers with higher credit limits tend to have more open-to-buy amounts available.
- Total_Trans_Amt and Total_Trans_Ct (0.81): High transaction counts are closely tied to high transaction amounts.
Moderate Positive Correlations:
- Customer_Age and Months_on_book (0.79): Older customers tend to have longer relationships with the bank.
- Total_Revolving_Bal and Avg_Utilization_Ratio (0.62): A higher revolving balance is associated with higher credit utilization.
Weak Correlations:
- Total_Relationship_Count shows weak negative correlations with transactional variables like Total_Trans_Amt (−0.35) and Total_Trans_Ct (−0.24), indicating that customers with more products may not necessarily have high transactional activity.
- Months_Inactive_12_mon and most variables have weak correlations, suggesting inactivity might not directly depend on demographic or credit-related factors.
Negative Correlations:
- Avg_Utilization_Ratio and Credit_Limit (−0.48): Customers with higher credit limits tend to have lower utilization ratios, as their spending may not fully utilize their available limits.
- Avg_Utilization_Ratio and Avg_Open_To_Buy (−0.54): High utilization means less open-to-buy, which is intuitive.
In [48]:
# Pair plot for all numerical columns
sns.pairplot(df, vars= numerical_columns, hue = 'Attrition_Flag',
diag_kind="kde", plot_kws={'alpha': 0.5})
plt.suptitle("Pair Plot of Numerical Columns", y=1.02)
plt.show()